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AMENDMENTS TO THE CLAIMS: 

1 . (Previously presented) A software method of improving at least one of efficiency and 
speed in executing a linear algebra subroutine on a computer having a floating point unit 
(FPU) and a load/store unit (LSU) capable of overlapping loading data and processing said 
data by the FPU, said method comprising: 

for an execution code controlling operation of said linear algebra subroutine 
execution, overlapping by preloading data into floating point registers (FRegs) of said FPU, 
said overlapping causing data to arrive into said FRegs to be timely executed by the FPU 
operations of said linear algebra subroutine on said FPU. 

2. (Previously presented) The method of claim 1, wherein instructions are unrolled 
repeatedly until the data loading reaches a steady state in which a data loading exceeds a data 
consumption. 

3. (Original) The method of claim 1, wherein said linear algebra subroutine comprises a 
matrix multiplication operation. 

4. (Currently amended) The method of claim 1 , wherein said linear algebra subroutine 
comprises a subroutine equivalent to a LAPACK (Linear Algebra PACKage) subroutine^as- 
modified in accordance with claim 1 . 

5. (Original) The method of claim 4, wherein said LAPACK subroutine comprises a BLAS 
Level 3 LI cache kernel. 
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6. (Previously presented) An apparatus, comprising: 

a memory to store matrix data to be used for processing in a linear algebra program; 

an LI cache to receive data from said memory; 

a floating point unit (FPU) to perform said processing; and 

a load/store unit (LSU) to load data to be processed by said FPU, said LSU loading 
said data into a plurality of floating point registers (FRegs), 

wherein said data processing overlaps said data loading such that matrix data is 
preloaded into said FRegs from said LI cache prior to being required by said FPU. 

7. (Original) The apparatus of claim 6, wherein said preloading is achieved by unrolling a 
loading instruction so that a load occurs every cycle until a preload condition has been 
satisfied. 

8. (Original) The apparatus of claim 6, wherein said linear algebra program comprises a 
matrix multiplication operation. 

9. (Currently amended) The apparatus of claim 6, wherein said linear algebra program 
comprises a subroutine equivalent to a LAPACK (Linear Algebra PACKage) subroutineras- 
modified in accordance with claim 6 . 

10. (Previously presented) The apparatus of claim 9, wherein said subroutine comprises a 
BLAS Level 3 LI cache kernel. 

11. (Original) The apparatus of claim 6, further comprising: 

a compiler to generate an instruction for said preloading. 
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12. (Previously presented) A computer-readable medium tangibly embodying a program of 
machine-readable instructions executable by a digital processing apparatus to perform a 
method of improving at least one of speed and efficiency in executing a linear algebra 
subroutine on a computer having a floating point unit (FPU) and a load/store unit (LSU) 
capable of overlapping loading data and processing said data, said method comprising: 

for an execution code controlling operation of said linear algebra subroutine 
execution, overlapping by preloading data into a floating point registers (FRegs) of said FPU, 
said overlapping causing data from an LI cache to arrive into said FRegs, to be timely 
executed by FPU operations of said linear algebra subroutine on said FPU. 

13. (Previously presented) The computer-readable medium of claim 12, wherein a load 
instruction is unrolled repeatedly until the data loading reaches a steady state in which a data 
loading exceeds a data consumption. 

14. (Previously presented) The computer-readable medium of claim 12, wherein said linear 
algebra program comprises a matrix multiplication operation. 

15. (Currently amended) The computer-readable medium of claim 12, wherein said linear 
algebra program comprises a subroutine equivalent to a LAPACK (Linear Algebra 

16. (Previously presented) The computer-readable medium of claim 15, wherein said 
subroutine comprises a BLAS Level 3 LI cache kernel. 
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17. (Previously presented) A method of providing a service involving at least one of solving 
and applying a scientific/engineering problem, said method comprising at least one of: 

using a linear algebra software package that computes one or more matrix 
subroutines, wherein said linear algebra software package generates an execution code 
controlling a load/store unit loading data into a floating point registers (FRegs) for a floating 
point unit (FPU) performing a linear algebra subroutine execution, said FPU capable of 
overlapping loading data and performing said linear algebra subroutine processing, such that, 
for an execution code controlling operation of said FPU, said overlapping causes a preloading 
of data from an LI cache into said FRegs; 

providing a consultation for purpose of solving a scientific/engineering problem using 
said linear algebra software package; 

transmitting a result of said linear algebra software package on at least one of a 
network, a signal-bearing medium containing machine-readable data representing said result, 
and a printed version representing said result; and 

receiving a result of said linear algebra software package on at least one of a network, 
a signal-bearing medium containing machine-readable data representing said result, and a 
printed version representing said result. 

18. (Currently amended) The method of claim 17, wherein said linear algebra subroutine 
comprises a subroutine equivalent to a LAPACK (Linear Algebra PACKage) subroutineras- 
modified in accordance with claim 17 . 

19. (Previously presented) The method of claim 18, wherein said subroutine comprises a 
BLAS Level 3 LI cache kernel. 
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20. (Currently amended) The method of claim 1, wherein said FPU comprises said FRegs as 
interfaced with an LI cache, said interface having a loading penalty of n cycles, said 
preloading eliminating this n-cycle penalty. 
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